Deep neural networks (DNNs) have become a standard componentin supervised ASR, used in both data-driven feature extraction andacoustic modelling. Supervision is typically obtained from a forcedalignment that provides phone class targets, requiring transcriptionsand pronunciations. We propose a novel unsupervised DNN-basedfeature extractor that can be trained without these resources in zeroresourcesettings. Using unsupervised term discovery, we find pairsof isolated word examples of the same unknown type; these provideweak top-down supervision. For each pair, dynamic programming isused to align the feature frames of the two words. Matching framesare presented as input-output pairs to a deep autoencoder (AE) neuralnetwork. Using this AE as feature extractor in a word discriminationtask, we achieve 64% relative improvement over a previous stateof-the-artsystem, 57% improvement relative to a bottom-up traineddeep AE, and come to within 23% of a supervised system.
展开▼